% LaTeX Template f�r Seminar "Wissenschaftliches Arbeiten" 182.697
% Uses IEEEtran style, adapted from bare_conf.tex
%
% v0.1  U. Schmid  26.9.2014    Initial version

% Ein paar n�tzliche Makros

\newcommand{\zitat}[1]{\lqq \emph{#1}\rqq}
\newcommand{\lqq}{\lq\lq}
\newcommand{\rqq}{\rq\rq}

%% bare_conf.tex
%% V1.3
%% 2007/01/11
%% by Michael Shell
%% See:
%% http://www.michaelshell.org/
%% for current contact information.
%%
%% This is a skeleton file demonstrating the use of IEEEtran.cls
%% (requires IEEEtran.cls version 1.7 or later) with an IEEE conference paper.
%%
%% Support sites:
%% http://www.michaelshell.org/tex/ieeetran/
%% http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/
%% and
%% http://www.ieee.org/

\documentclass[conference,A4]{IEEEtran}

% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
  \usepackage[pdftex]{graphicx}
  % declare the path(s) where your graphic files are
  % \graphicspath{{../pdf/}{../jpeg/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  % \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
  % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
  % will default to the driver specified in the system graphics.cfg if no
  % driver is specified.
  % \usepackage[dvips]{graphicx}
  % declare the path(s) where your graphic files are
  % \graphicspath{{../eps/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  % \DeclareGraphicsExtensions{.eps}
\fi

% *** PDF, URL AND HYPERLINK PACKAGES ***
%
\usepackage{url}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
% Read the url.sty source comments for usage information. Basically,
% \url{my_url_here}.

% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex).         ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )
\usepackage{newfloat}
\usepackage[dvipsnames]{xcolor}
\usepackage{hyperref}
\hypersetup{
    colorlinks=true,
    linkcolor=RedViolet,
    citecolor=RoyalPurple
}

\DeclareFloatingEnvironment[fileext=lod]{diagram}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
\title{Accelerator Architectures based on highly parallel FPGAs - Ayyy lmao Survey}

% author names and affiliations
% use a multiple column layout for up to three different
% affiliations
\author{\IEEEauthorblockN{name lastname}
\IEEEauthorblockA{matriculation number: matriculation number\\
study code: Computer Engineering\\
university\\
Email: \textcolor{NavyBlue}{email}}}

\maketitle

\begin{abstract}

Field-Programmable Gate Arrays (FPGAs) have shown great promise as accelerators for a variety of applications.
Compared to more specialized architectures such as those by graphics processing units (GPUs), FPGAs can exploit fine-grain parallelism because of their flexible and less fixed architecture.
A couple of attempts at accelerating applications by using multiple FPGAs have been made in the past.
This paradigm offers unique possibilities and can obtain even higher performance, although communication among the FPGAs and the partitioning of the task pose a crucial challenge.
This literature survey aims to give a broad overview of the problems and solutions of accelerator architectures based on highly parallel FPGAs, as well as their properties and the results.

\end{abstract}

\section{Introduction}
\label{sec:introduction}

There has been a wealth of papers that make an attempt at optimizing specific problems by use of FPGA-based (Field-Programmable Gate Arrays) acceleration platforms, as will become apparent throughout this literature survey \cite{kalm16}.
This optimization can refer to different metrics, as FPGAs have been used to decrease the latency of specific computationally intensive operations, to increase throughput or to reduce energy costs.
Depending on the specific use-case, such optimizations are sometimes optional (although nice to have), but there might even be use-cases where FPGA acceleration serves to address non-optional problems.
These include but are not limited to systems where the locality of sensors and actuators requires devices that physically interface to them \cite{jone99}.
To mention just one more possible aspect of this nature, the use of clusters of off-the-shelf FPGA boards can reduce costs \cite{kalm16} and can be used to provide enormous computing capacities in large data centers \cite{knod13}.

The specific purpose of this literature survey is not to examine FPGA-based acceleration platforms discussing a single FPGA, but to concentrate on the use of multiple FPGAs in a distributed architecture.
In part because of novelty, as there are less papers in this space, but more so due to the unique problems and benefits the use of multiple FPGA-based acceleration platforms can afford.
We will also attempt to examine what the focus of such platforms is in terms of the aforementioned optimization metrics and how their implementation serves to address them.

The structure of the literature survey is as follows: we will briefly explain \hyperref[sec:concepts]{terms and concepts}, give an \hyperref[sec:classification]{overview of the classification scheme} and then dive into the categories of classification (\hyperref[sec:networking]{networking}, \hyperref[sec:partitioning]{problem partitioning} and \hyperref[sec:applications]{applications}). Subsequently, we will examine some of the results of the various acceleration schemes and give short conclusion.

\section{Terms and Concepts}
\label{sec:concepts}

\subsection{multiple FPGA-based acceleration platforms (MFAPs)}

Throughout this survey paper we will refer to multiple FPGA-based acceleration platforms (MFAPs).
Specifically, MFAPs are platforms that make use of multiple FPGAs for acceleration purposes but are often complemented by additional components that are either necessary or useful towards the use-case of the MFAP.
They can be a part of an architecture that includes general purpose computing elements, such as CPUs and GPUs; network elements such as Ethernet switches and other similar components.
Depending on the use-case and architecture, FPGA resources are coupled with those of other components in various configurations \cite{knod13}.
An example of a CPU+FPGA heterogeneous platform that is sometimes used in conjunction with the framework that will be outlined in the next section is the Xilinx Zynq SoC (ARM+FPGA) \cite{nesh15}.

\subsection{Apache Hadoop and MapReduce}

Although not all architectures make use of the \textit{Apache Hadoop}\footnote[1]{can be found at https://hadoop.apache.org/, accessed 3rd November 2021} platform, it has become an industry standard for big data analytics \cite{alha15, du19}.
The framework represents an open-source Java based software implementation of \textit{MapReduce}, a programming model developed by Google.
\textit{Apache Hadoop} incorporates a distributed file system (HDFS) with fault-tolerance and benchmark primitives that add to its attractiveness for use in the space of distributed computing \cite{nesh15}.

To give a rough overview of the \textit{MapReduce} paradigm, which will become important to understand some of the architectures, we will give a rough outline based on the work of \cite{nesh15}.
The framework receives data which is redirected to a map stage.
Depending on whether there is more than one mapper, the incoming data has to be split accordingly.
The mappers then manipulate the incoming data, which is supplied and emitted in the form of \texttt{<key, value>} pairs, such that the resulting pairs can be merged into groups that are now ready to be distributed for parallel processing \cite{nesh15}.
The aforementioned input data is stored in the HDFS.
Because the data can grow quite large, compression is used when data is passed around.
If multiple mappers are provided, the data is divided in a way that ensures fault tolerance and is usually assigned based on physical proximity \cite{alha15}.

\section{Classification of MFAPs}
\label{sec:classification}

Before diving into the many facets of MFAPs, we would do ourselves a good service by organizing the kinds of solutions that are about to be outlined. The list in diagram \ref{list:classification} highlights different aspects that have to be taken into consideration when designing such a platform in a tree like list. The most indented elements in this list represent actual choices while their parents make up categories that classify those choices. This list of choices is not exhaustive, as it merely serves to highlight choices made in the literature that is reviewed in the context of this literature survey.

\begin{diagram}
    \centering
    \begin{itemize}    
        \item classification
        \begin{itemize}
            \item networking
            \begin{itemize}
                \item technology
                \begin{itemize}
                    \item pre-existing/standard IP cores (Ethernet, WiFi, PCI/PCIe)
                    \item custom (\textit{BlueLink} \cite{kalm16}, BridgeFifo \cite{prit20}, Postmaster Direct Memory Access (DMA) \cite{prit20})
                \end{itemize}
                \item topology
                \begin{itemize}
                    \item mesh (3D mesh \cite{prit20})
                    \item hierarchical/tree (cell-routing hubs \cite{chun95})
                    \item ring (cell-routing hubs \cite{chun95})
                \end{itemize}
                \item semantics
                \begin{itemize}
                    \item direct routing
                    \item broadcast
                    \item multicast
                    \item anycast
                \end{itemize}
            \end{itemize}
    
            \item problem/application partitioning
            \begin{itemize}
                \item problem/application mapping
            \end{itemize}

            \item applications
            \begin{itemize}
                \item computational model
                \begin{itemize}
                    \item loose coupling (\textit{MapReduce})
                    \item tight coupling
                \end{itemize}
            \end{itemize}
        \end{itemize}
    \end{itemize}
    \caption{various classification properties of MFAPs and their examples}
    \label{list:classification}
\end{diagram}

We will consider in more detail the problem of networking as well as problem partitioning and discuss applications as well as results of the various attempts at designing MFAPs.

\section{Networking}
\label{sec:networking}

Networking in MFAPs is concerned with the way data is passed around when using multiple FPGAs to solve a particular problem. The problem of communication within an MFAP was taken under this broader classification category to group the various networking technologies, communication protocols, networking topologies and communication semantics.
There are different network technologies, network topologies and network semantics that can appear in the various configurations of MFAPs.
MFAPs can have quite diverse configurations that combine different technologies, topologies and semantics.
This diversity stems in part from the fact that FPGAs are sometimes integrated into broader architectures in order to use them for the acceleration of common computations that could be done on a regular computer, like for example in \cite{asse21}.

\subsection{on the multiplicity of technologies and semantics}

The designs of MFAPs are sometimes not even limited to one technology or one kind of semantics.
The \textit{IBM neural computer} that is the work of P. Narayanan et al. \cite{prit20} offers multiple different virtual or logical interfaces the processors and FPGAs can use as a communications network.
They have shown in their work what they initially claim, which is that multiple virtual channels can be designed to sit atop the underlying packet router logic.
In their case, a standard Internal Ethernet, Postmaster DMA and Bridge FIFO are implemented.
The underlying hardware features a 3D mesh network with a packet routing scheme that allows for directed routing and broadcasting \cite{prit20}.

Internal Ethernet is a virtual interface that is designed to appear similar to an actual Ethernet interface.
This has interesting advantages but also drawbacks.
It is argued that this allows many standard applications for the IP network to be used, but with an overhead that can be attributed to the TCP/IP stack. \cite{prit20}

Postmaster DMA is a mechanism intended for the transfer of data between nodes (FPGA boards). The specific semantics of this communication scheme is that the nodes send out data that can either be consumed at the target node or is directly written to a mapped memory region at the target node. What sets Postmaster DMA apart from Bridge FIFO, is the feature of writing to mapped memory regions with little overhead, as well as harboring queue semantics instead of FIFO semantics. \cite{prit20}

\subsection{networking technologies}

The network technology that is used (also sometimes referred to as the interconnection type) is a challenge for FPGA cluster designs, as argued by L. Kalms and D. Göhringer \cite{kalm16}.
Their work is concerned less with the solving of a specific problem using MFAPs, but with a general way to tackle clustering, application distribution and workload balancing.
They also comment that connecting FPGA boards via Ethernet or PCIe are the most common approaches.
This makes sense, because standard communication protocols like these have IP cores already developed and commercially available from vendors \cite{theo14}.
Additionally, switches and cabling, A. T. Markettos et al. \cite{theo14} argue, are commodity items.
Lastly, these standard communication protocols are well understood and convenient.
But they also refer to the difficulties:

\begin{itemize}
    \item \textit{configuration constraints} (inappropriate available parameters)
    \item \textit{fitting requirements} (standard requiring specific clock frequencies, PLLs or clock routing)
    \item \textit{bonded links} (not enough lanes, unsuitable placement, or skew)
    \item \textit{manufacturer specificity} (standards not supported by other manufacturers)
    \item \textit{FPGA support} (cores supporting only some FPGA families)
    \item \textit{licensing} (prohibitive costs)
\end{itemize}

L. Kalms and D. Göhringer \cite{kalm16} make a case for a custom built solution under the name of \textit{BlueLink}, a \textit{"lightweight pluggable interconnect library"} \cite[p. 106]{kalm16}, just as the creators A. T. Markettos et al. do.
In fact, the latter team even goes as far as to say that \textit{"whilst soft cores for standard protocols (Ethernet, RapidIO, Infiniband, Interlaken) are a boon for FPGA-to-other-system interconnect, we argue that they are inefficient and unnecessary for FPGA-to-FPGA interconnect"} \cite[p. 1]{theo14}.
In the case of \textit{BlueLink}, several properties were identified as requirements for the applications \textit{BlueLink} might be used for: \textit{small message sizes}, \textit{low latency}, \textit{reliable}, \textit{hardware-only}, \textit{lightweight}, \textit{ubiquitous} and \textit{interoperable}.

\subsection{networking topologies}

The idea behind MFAPs was pursued already in 1995, as the work of Chun-Chao Yeh, Chun-Hsing Wu and Jie-Yong Juang \cite{chun95} demonstrates.
They, like others too, were confronted with designing a network that is suitable for the purpose of an MFAP.

Their specific considerations concerning the interconnection network architecture have lead to an interesting design decision: \textit{"[partition] the system into several clusters interconnected by an external network composed of cell routing hubs"} \cite[p. 56]{chun95}.
With this design decision, the overlayed network is an abstraction the individual computation nodes are not responsible for anymore.

This is reminiscent of classical networking, where components such as Layer 2 switches facilitate the network to operate without any additional duties.
In contrast, P. Narayanan et al. \cite{prit20} for example demonstrate an architecture in which the individual nodes facilitate the operation of the network.
They are not alone, The \textit{BlueLink} system by A. T. Markettos \cite{theo14} uses a similar hop-by-hop routing scheme (as they call it) for communication.
Both paradigms are still fashionable, as several newer works by C.-C. Chung et. al \cite{chun15, chun17} and A. Alhamali et. al \cite{alha15} show.

Which of these two paradigms is better is likely situational and depends at least on the technology that is being used.
In any case, FPGAs are large enough for a standard Ethernet IP core to be included as some of the papers show, eliminating the need for external routing and switching devices.
One could even design an architecture in which there are sub-clusters, such that only one of the nodes needs this IP core, as is the case in the \textit{IBM neural computer} \cite{prit20}.
These papers at least indicate that an externally overlayed network is popular when standard Ethernet is being used, while an internally overlayed network makes for a good alternative for custom solutions.

To give a more concrete example of what is possible, we will briefly look into the detailed topology of the \textit{IBM neural computer (INC)}.
The individual FPGAs are arranged in a logical 3x3x3 3D mesh per INC card.
The actual placement of nodes \textit{"minimizes the connection lengths between logically adjacent nodes"} \cite[p. 3]{prit20}.
Only one node supports Ethernet, but up to two nodes support PCIe 2.0 connections (possibly to a host PC).
Backplanes were designed to fit 16 INC cards and to be arranged in a 12x12x3 mesh.
The configuration that is shown features 416 FPGAs.

\subsection{networking semantics}

Network semantics is related to the other aspects of networking, just like technology and topology already were.
We will not spend too much time on observing detailed individual differences between existing architectures but instead mention what network semantics refers to and give some examples.
Depending on the dataflow in networks for MFAPs, an implementation with one specific semantic construct for communication will suffice.
The two major types of semantics are direct routing and broadcasting.
Two other less prominent types of semantics are multicast and anycast.
In \cite{chun95}, which used an externally overlayed network that does not make use of standardized communication protocols such as Ethernet, there was a facility for multicast addressing, but anycast addressing is not mentioned.
Systems that do make use of something like Ethernet gain access to all facilities provided by it.
Systems or frameworks that implement custom technologies can choose what to implement.

\section{problem/application partitioning}
\label{sec:partitioning}

All MFAPs have to deal with the issue of problem partitioning (sometimes called application partitioning).
This makes sense, whenever there are multiple nodes for computation, there has to be some way to generate an assignment of computation requests to said nodes.

The architecture by C.-C. Chung et al. \cite{chun15} uses a partitioning that is very specific to the problem being solved, namely Big Data matrix processing.
In their MFAP, one hundred 512x512 floating point matrix multiplications are computed.
The workload is split amongst different nodes, where it gets processed by the hardware implementation realized through the FPGA.
Even at this node-level stage however does problem partitioning take place:
each FPGA implements four matrix processors (up to 16 are possible in their design) and a processor master. The matrices to be multiplied are split into smaller ones such that processing can take place in parallel.
The number of matrix processors poses a trade-off between FPGA resources and execution time.

Other work by M. Owaida and G. Alonso \cite{owai18} tackles this problem of problem partitioning more head-on: problem partitioning in a general way is their novel contribution.
In their case, inference over decision tree ensembles is used to highlight the nuances of problem partitioning.
\textit{Inference problems} come from the space of machine learning and represent resource-demanding operations, going by the words of M. Owaida and G. Alonso.
They also identify general example paradigms by which problems can be partitioned: distribution from a central master node, broadcasting from a central node.
Depending on the nature of the problem and solution, intermediate results might have to be passed around between FPGAs and/or partial results have to be aggregated and written back to the application software.
These are two complications they additionally identified.
Before diving into their intricate problem partitioning proposal: their architecture is composed of a network of nodes that has two different modes of communication, inter-FPGA and inter-CPU communication.
Each FPGA implements an inference engine and is coupled together with a CPU.

In their work \cite{owai18}, multiple forms of problem partitioning are considered depending on dimensionality of the data, as well as whether the problem is compute-bound or network-bound.
In the first mapping scenario, the tree ensemble for the inference does not fit a single FPGA, so the tree ensemble needs to be partitioned across multiple FPGAs and the computations resulting from individual FPGAs later recombined.
In the second mapping scenario, the tree ensemble does fit a single FPGA, but is compute-bound.
First the tree ensemble is broadcasted to all nodes and the data for which inference should be computed is partitioned.
The results are passed back 
The third mapping scenario then tackles instances in which the problem is network-bound.
This bound refers to inter-FPGA communication, so naturally inter-FPGA communication is avoided.
It is argued that such a model makes sense if the data is already distributed, like it would with a distributed database.

\subsection{problem/application mapping}

Sometimes extra steps are either needed or sensible when partitioning problems for MFAPs.
One such case is made by G. Fiscaletti et al. \cite{fisc20}, whose work focuses on implementing a framework called \textit{FINN}\footnote[1]{can be found at https://xilinx.github.io/finn/, accessed 5th November 2021} in a distributed context.
The framework represents a, in the words of G. Fiscaletti et al., \textit{"state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs"} \cite[p. 975]{fisc20}.
The point is to reduce floating point operations (which are typical for neural network applications) to low-precision arithmetic operations.
The specific contribution of their work is addressing the problem of how to \textit{"manage the ever-increasing size of BNNs architectures by exploiting distributed systems with multiple processing elements"} \cite[p. 976]{fisc20}.
Arguably, all MFAPs based on the \textit{MapReduce} computational model make use of problem mapping, since there is a map step by definition.
How those MFAPs operate roughly and what problems they solve will be discussed in the next section however.

\section{Applications of MFAPs}
\label{sec:applications}

The purpose of this section is to go over the concrete use-case of every MFAP that is part of this literature survey, if applicable. In the case of general frameworks for the facilitation of MFAP problem solving, possible applications will be cited.
After already having looked in great detail at networking and problem partitioning considerations, we should be well equipped for the examination of such applications.

\subsection{computational model}

A. T. Markettos et al. \cite{theo14} classify applications into two kinds: \textit{loosely coupled} and \textit{tightly coupled} applications.
This distinction is made based on whether the FPGAs need to communicate with each other when processing partitioned problem workloads.
MFAPs where FPGAs do not need to communicate with each other are considered \textit{loosely coupled}, of course.
That was already a point of relevance when we discussed the work of M. Owaida and G. Alonso \cite{owai18}, but this distinction is further made useful by their examples for systems pertaining to both groups.

\subsection{loosely coupled applications}

All the MFAPs that use \textit{MapReduce} are loosely coupled \cite{theo14}. We will go over a couple of MFAPs that use \textit{MapReduce} (typically employed via the use of the \textit{Apache Hadoop} framework).

A. Alhamali et al. \cite{alha15} have designed an MFAP that is used to accelerate deep learning computations (using \textit{Apache Hadoop}).
The part of the deep learning process that is being accelerated are the convolutional layers of the neural network.
Their explanation of these convolutional layers shall serve as a guide as to what is happening: \textit{"convolutional neural networks (CNN) deep learning architectures employ a set of convolutional layers that recursively extract features from the input space before forwarding them to a multi-layer perceptron (MLP) circuit which acts as the system classifier"} \cite[p. 565]{alha15}.
They employ a variation of a previously proposed technique under the name of \textit{Parallel Stochastic Gradient Descent} (Parallel SGD).
The concrete algorithm is to train separate models of the CNN and combine all of them by averaging the weights of those resulting CNNs \cite{alha15}.
An important point is brought up in their rationale: \textit{"model computation ... is highly data-parallel, which is ideal for the map-reduce framework"} \cite[p. 569]{alha15}.

K. Neshatpour et al. \cite{nesh15} have done similar work (using \textit{Apache Hadoop}).
Their task was to accelerate machine learning kernels using an MFAP.
Kernels in this context refer to computations used in machine learning paradigms that have a high computational cost (or as they are referred to in their work, \textit{hotspot functions} \cite{nesh15}).
Four specific kernels were examined: K-means clustering, KNN classification, SVM-learn and Naive Bayes classification.
They weren't the only ones using \textit{Apache Hadoop} to accelerate a K-means clustering algorithm, similar work was also done by C.-C. Chung et al. \cite{chun17}.
Both architectures comprise a network switch through which a master node communicates with slave nodes.
In both cases the nodes are heterogeneous CPU+FPGA computing platforms, although the former uses ARM CPUs, while the latter makes use of Intel i7 CPUs.

The more recent example of H. Du et al. \cite{du19} departs a bit from previous attempts to utilize MFAPs to accelerate computations.
In their case, the focus is on accelerating a different aspect of the \textit{Apache Hadoop} stack: the compression that takes place when data is passed around.
The framework makes use of the HDFS, as such there is a cost to the transmission of large amounts of data in terms of disk I/O and network transmission \cite{du19}.
Like is explained in their work, this can be mitigated via the use of compression, but then CPU resources have to be used.
To highlight one of the difficulties of this type of acceleration, we briefly mention the case of timeline parallelism: as compression can be enabled even for outputs of the map or reduce stage, the API has to offer compression and decompression facilities and both can be active on different nodes at the same time \cite{du19}.

Another more recent example is the work of G. Fiscaletti et al. \cite{fisc20}, which is similarly to \cite{alha15} concerned with the task of accelerating convolutional neural networks (CNNs).
However, their approach does not make use of the \textit{MapReduce} paradigm or the \textit{Apache Hadoop} framework.
Their novel contribution was already mentioned in the segment about problem mapping: mapping the operations typically used in CNNs to binarized neural networks (BNNs), in the context of an MFAP.

A case can be made that A. Asseman et al. \cite{asse21} belongs to this category as well.
Their work builds upon the \textit{IBM neural computer} that has been explored in previous segments.
In their case, a specialized form of machine learning called deep neuroevolution is used in the context of an MFAP, another gradient descent based optimization technique that is derivative-free and thus fit for a distributed architecture \cite{asse21}.
The specific problems deep learning is used to solve in their work are video games (Atari 2600 games), which abstractly model a useful kind of problem where \textit{"an agent learns an optimal behavior by observing and interacting with the environment"} \cite[p. 1]{asse21}.
They explain that \textit{"rather than accelerating the optimization algorithm ... we have taken a different approach and addressed the data generation"} \cite[p. 2]{asse21}.
Here they are referring to the task of emulating the game and extracting frames for the learning process.
Instead of emulating the games, which were made for the Atari 2600 game console, they directly took advantage of the FPGAs to emulate the games in hardware directly.
This has created an opportunity to accumulate training frames with a significant speedup.
Because of the semantics of frame generation, this can be considered another loosely coupled application.

\subsection{tightly coupled applications}

Those applications that do not fit the description of a loosely coupled application belong to this class.
A. T. Markettos et al. \cite{theo14} mentions the example gate-level system-on-chip simulations.
A defining property of this application is their latency dependency: \textit{"nodes operating in lock-step require single-cycle interconnect latency"} \cite[p. 2]{theo14}.

\section{Results}
\label{sec:results}

In this section, we want to take a crude look at the results of previously discussed MFAPs.
There is an inherent difficulty in comparing the results of different attempts at accelerating problem using MFAPs, as the varying architectures and types of applications have different metrics that can be measured, added to the complexities of benchmarking in general.
MFAPs that use \textit{Apache Hadoop} can be said to compare more favorably in this regard, the framework includes a set of micro-benchmarks, like for example clustering, classifiers and data compression \cite{nesh15}.
The question whether a standard benchmark that broadly captures the acceleration by MFAPs for the varying properties that are accelerated is even possible.
No work was found that examines this problem.

As a meta-result of the classification scheme used for this survey, not many tightly coupled applications were identified, but a great number of loosely coupled ones.
A possible bias in the selection of papers should not be ruled out, although of course care was taken to give the best possible overview.
We will now turn to the surveyed papers however.

The work of A. Asseman et al. \cite{asse21} has seen considerable improvements in speed.
Their attempt at acceleration was concerned with the generation of training frames, so the speedup can either be considered to mean a faster training time for a neural network that was trained with the same amount of frames, or a neural network that was trained with more frames in the same time-span.
The benchmark is a direct comparison with contending neural network architectures for many different games.

Others have used Amdahl's law to compute a concise value for the acceleration, like A. Alhamali et al. \cite{alha15}.
This means that the MFAP is compared to a sequential execution platform for the convolutional layers.
They measured a total speedup of 12.6 times from the two architectural improvements that comprise their contribution, as well as an energy reduction of around 7.7\% overall energy consumption, with an estimated possible overall reduction of 87.5\% when all convolutional layers are accelerated \cite{alha15}.

The big data matrix processing MFAP by C.-C. Chung et al. \cite{chun15} compares the performance of their MFAP to a computing server with an x86 Intel CPU and records a fourfold speedup. The same speedup was reported, also by C.-C. Chung et al. in \cite{chun17}, where their MFAP for K-means clustering is compared against the \textit{Apache Mahout} machine learning libraries.

H. Du et al. \cite{du19} built an MFAP that accelerates compression for the \textit{Apache Hadoop} framework, so they instead examine CPU time spent for the (de-)compression and throughput of the system.
Speedups are reported, particularly for write operations with a speedup ratio of up to 6.28, whereas read operation speedups are less noticeable.

The MFAP by K. Neshatpour et al. \cite{nesh15} was designed to accelerate different machine learning kernels and they report varying speedups.
While the K-means algorithms yields an impressive speedup of 94 times in their case, other kernels such as SVM-learn see almost no acceleration.

Works on MFAPs that are less straight-forward in their concept are more confusing to interpret.
The work done by L. Kalms and D. Göhringer \cite{kalm16} for example does not consider a concrete application but is instead concerned with a general way to distribute an application by clustering and mapping.
They instead examine the complexity of their approach and report good scalability.

Lastly we want to consider the results of A. T. Markettos et al. \cite{theo14}, who architected the \textit{BlueLink} system.
They specifically chose the Stratix V platform, as they used a Stratix V GX FPGA to compare \textit{BlueLink} against Altera's existing 10[Gbps] Ethernet MAC, in their words.
The comparison has many dimensions that are of interest, for example, they report that 40Gbps BlueLink fits a similar area to 10Gbps Ethernet MAC (10Gbps BlueLink also compares favorably by using less area but also just 15\% of the memory).
A distinction is made between standards that support reliable transmission and those that do not, although only in the category of those that do not can contenders to \textit{BlueLink} be found.
One more interesting metric is the throughput of their standard, a performance value that is related to the packet size and general overhead of transmissions.
Fig. \ref{fig:overhead} is taken out of their work and highlights the packet size dependent performance when compared against 10Gbps Ethernet MAC.

\begin{figure}[ht]
\begin{center}
\includegraphics[width=\columnwidth]{theo14-overhead.png}
% where an .eps filename suffix will be assumed under latex, 
% and a .pdf suffix will be assumed for pdflatex; or what has been declared
% via \DeclareGraphicsExtensions.
  \caption{verbatim: Overhead of BlueLink against Ethernet-based standards for small packets. BlueLink makes considerably better use of bandwidth up to 256-bit packets \cite{theo14}.}
  \label{fig:overhead}
\end{center}
\end{figure}

This figure was chosen as a representative for the category of network benchmarks, something that is examined in great detail in the work by A. T. Markettos et al. \cite{theo14}, as the network performance of their solution is integral.

\section{Conclusion}
\label{sec:conclusion}

The space of multiple FPGA-based acceleration platforms is populated by solutions to specific applications as well as general frameworks that facilitate such acceleration.
The works examined in this survey generally report favorable acceleration results across different metrics, although this acceleration comes at the cost of complexity.
All of the people whose designs have been examined have put in considerable thought about the architecture of such platforms.
More specifically, aspects pertaining to networking, problem partitioning as well as the actual application are of considerable importance and deserve such thought.

\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
\bibliography{bibfile}

% that's all folks
\end{document}
